Countries GDP Prediction - Random Forest (RF) - Regression
Demo
Live Web App Demo Link
Deployment on Heroku: https://mlgdp.herokuapp.com/
Deployment on Streamlit: https://share.streamlit.io/monicadesai-tech/project_78/main/app.py
Abstract
The purpose of this report will be to use the countries of the world.csv to predict GDP of given country depending on details provided by user.
This can be used to gain insight into how and why GDP is such at a given time. This can also be used as a model to gain a marketing advantage, by advertisement targeting those which countries have higher GDP because of few laws or reasons or targeting those countries which are developing nations and their impact and relations with other countries. Countries GDP Prediction is a regression problem, where using the various parameters or inputs from user model will supply result.
This is diving into Countries GDP Prediction through Machine Learning Concept. End to End Project means that it is step by step process, starts with data collection, EDA, Data Preparation which includes cleaning and transforming then selecting, training and saving ML Models, Cross-validation and Hyper-Parameter Tuning and developing web service then Deployment for end users to use it anytime and anywhere. This repository contains the code for Countries GDP Prediction using python’s various libraries.
It used numpy, pandas, matplotlib, seaborn, sklearn and streamlit libraries. These libraries help to perform individually one particular functionality. Numpy is used for working with arrays. It stands for Numerical Python. Pandas objects rely heavily on Numpy objects. Matplotlib is a plotting library. Seaborn is data visualization library based on matplotlib. Sklearn has 100 to 200 models. Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning. The purpose of creating this repository is to gain insights into Complete ML Project. These python libraries raised knowledge in discovering these libraries with practical use of it. It leads to growth in my ML repository. These above screenshots and video in Video_File Folder will help you to understand flow of output.
Motivation
The reason behind building this project is, because I personally like to travel and explore new places. At the same time, to gain knowledge about respective country, its culture, their language, its currency and about the government. As this indicates power of citizens/unity or discipline mannerism in individuals and reason that it lacks in other countries. So, I created Countries GDP Prediction Project to gain insights from IT perspective to know working of the model. Hence, I continue to spread tech wings in IT Heaven.
Acknowledgment
Dataset Available: https://www.kaggle.com/fernandol/countries-of-the-world
The Data
It has 20 columns with three columns being numeric and rest all categorical.
Re-naming Column names.
It displays number of missing/ null values in each column.
Analysis of Data
Let’s start by doing a general analysis of the data as a whole.
Basic Statistics
Data validation checks that data are valid, sensible, reasonable, and secure before they are processed.
Graphing of Features
Graph Set 1
We can see from above that we have some missing data points but it is not extensive, 14/20 of our columns have missing data percentage of missing data is in the ‘Climate’ column and it is less than 10% (22/227).
Graph Set 2
Some insights from the above correlation heatmap:
1. expected strong correlation between infant_mortality and birthrate.
2. unexpected strong correlation between infant_mortality and agriculture.
3. expected strong correlation between infant_mortality and literacy.
4. expected strong correlation between gdp_per_capita and phones.
5. expected strong correlation between arable and other (other than crops).
6. expected strong correlation between birthrate and literacy (the less literacy the higher the birthrate).
7. unexpected strong correlation between birthrate and phones.
Graph Set 3
We can see a fair correlation between GDP and migration, which makes sense, since migrants tend to move to countries with better opportunities and higher GDP per capita.
Graph Set 4
From the above figures, we can notice the following:
1. Sub-Saharian Africa and Latin America regions have the most countries within them.
2. Western Europe and North America have the highest GDP per capita, while Sub-Saharian Africa has the lowest GDP per capita.
3. Asia, North America, and North Europe, are the main regions where migrants from other regions go.
4. Asia has the largest population; Oceania has the smallest.
Graph Set 5
The figure below shows the regional ranking according to the average GDP per capita. As expected, North America and Western Europe have the highest GDP per capita, while Sub Saharian Africa has the lowest, and that may describe the large migration trends in the world in the past decade.
Graph Set 6
From the above figure, it is clear that the higher the country's GDP, the more literate the population is, and vice-versa.
Modelling
The purpose of these models will be to get effective insight into the following:
1. If GDP of country depending on various factors:
• This insight can be used for Market Targeting.
2. Get insight into how changing RMSE of the predictions affect:
• Spending more money to target the researchers/political legends that are most likely to retain innovations and on strict following of laws or spending less money on education and basic health-care infrastructure or reasons responsible for GDP.
Where to Use RF Regression: RF Regression Example
Let’s say you want to estimate the average household income in your town. You could easily find an estimate using the Random Forest Algorithm. You would start off by distributing surveys asking people to answer a number of different questions. Depending on how they answered these questions, an estimated household income would be generated for each person. After you’ve found the decision trees of multiple people you can apply the Random Forest Algorithm to this data. You would look at the results of each decision tree and use random forest to find an average income between all of the decision trees. Applying this algorithm would provide you with an accurate estimate of the average household income of the people you surveyed.
Math behind the metrics
MAE (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.
Algorithm for Random Forest Regression:
GridSearchCV Vs RandomSearchCV:
Algorithm for Gradient Boosting Regression:
Model Architecture Process Through Visualization
Quick Notes
Step 1: Imported required libraries.
Step 2: Read the Data.
Step 3: Analysed the data.
Step 4: Performed Data Cleaning.
Step 5: Performed EDA.
Step 6: Performed Data Pre-conditioning.
Step 7: Performed Model Building, Prediction, Evaluation and Optimization.
- Linear Regression
- SVM
- Random Forest
- Gradient Boosting
Step 8: Created Web App.
The Model Analysis
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score Imported required libraries – When we import modules, we're able to call functions that are not built into Python. Some modules are installed as part of Python, and some we will install through pip. Making use of modules allows us to make our programs more robust and powerful as we're leveraging existing code.
data = pd.read_csv('countries of the world.csv') Read the Data – You can import tabular data from CSV files into pandas data frame by specifying a parameter value for the file.
data.head(3)
data.info()
data.columns = (["country","region","population","area","density","coastline_area_ratio","net_migration","infant_mortality","gdp_per_capita",
"literacy","phones","arable","crops","other","climate","birthrate","deathrate","agriculture","industry",
"service"])
data.country = data.country.astype('category')
data.region = data.region.astype('category')
data.density = data.density.astype(str)
data.density = data.density.str.replace(",",".").astype(float)
data.coastline_area_ratio = data.coastline_area_ratio.astype(str)
data.coastline_area_ratio = data.coastline_area_ratio.str.replace(",",".").astype(float)
data.net_migration = data.net_migration.astype(str)
data.net_migration = data.net_migration.str.replace(",",".").astype(float)
data.infant_mortality = data.infant_mortality.astype(str)
data.infant_mortality = data.infant_mortality.str.replace(",",".").astype(float)
data.literacy = data.literacy.astype(str)
data.literacy = data.literacy.str.replace(",",".").astype(float)
data.phones = data.phones.astype(str)
data.phones = data.phones.str.replace(",",".").astype(float)
data.arable = data.arable.astype(str)
data.arable = data.arable.str.replace(",",".").astype(float)
data.crops = data.crops.astype(str)
data.crops = data.crops.str.replace(",",".").astype(float)
data.other = data.other.astype(str)
data.other = data.other.str.replace(",",".").astype(float)
data.climate = data.climate.astype(str)
data.climate = data.climate.str.replace(",",".").astype(float)
data.birthrate = data.birthrate.astype(str)
data.birthrate = data.birthrate.str.replace(",",".").astype(float)
data.deathrate = data.deathrate.astype(str)
data.deathrate = data.deathrate.str.replace(",",".").astype(float)
data.agriculture = data.agriculture.astype(str)
data.agriculture = data.agriculture.str.replace(",",".").astype(float)
data.industry = data.industry.astype(str)
data.industry = data.industry.str.replace(",",".").astype(float)
data.service = data.service.astype(str)
data.service = data.service.str.replace(",",".").astype(float)
data.info()
data.describe()
print(data.isnull().sum())
sns.heatmap(data.isnull()).set(title = 'Missing Data', xlabel = 'Columns', ylabel = 'Data Points')
data.loc[[27,51, 101, 118, 219], ['country', 'population', 'area', 'coastline_area_ratio', 'gdp_per_capita']]
data.loc[:, ['country', 'region', 'climate', 'agriculture', 'industry', 'service']].head()
data.climate.unique()
h1 = data.loc[:, ['country', 'region', 'climate']][data.climate == 1].head()
h2 = data.loc[:, ['country', 'region', 'climate']][data.climate == 2].head()
h3 = data.loc[:, ['country', 'region', 'climate']][data.climate == 3].head()
h4 = data.loc[:, ['country', 'region', 'climate']][data.climate == 4].head()
h5 = data.loc[:, ['country', 'region', 'climate']][data.climate == 1.5].head()
h6 = data.loc[:, ['country', 'region', 'climate']][data.climate == 2.5].head()
pd.concat([h1, h2, h3, h4, h5, h6]) Analysed the data – Using ‘.info(), .astype(), .isnull().sum(), .unique()’ commands and imputing it. Checking unique number of categories in particular column. Checking missing values to deal with them.
print(data.isnull().sum())
data['net_migration'].fillna(0, inplace=True)
data['infant_mortality'].fillna(0, inplace=True)
data['gdp_per_capita'].fillna(2500, inplace=True)
data['literacy'].fillna(data.groupby('region')['literacy'].transform('mean'), inplace= True)
data['phones'].fillna(data.groupby('region')['phones'].transform('mean'), inplace= True)
data['arable'].fillna(0, inplace=True)
data['crops'].fillna(0, inplace=True)
data['other'].fillna(0, inplace=True)
data['climate'].fillna(0, inplace=True)
data['birthrate'].fillna(data.groupby('region')['birthrate'].transform('mean'), inplace= True)
data['deathrate'].fillna(data.groupby('region')['deathrate'].transform('mean'), inplace= True)
data['agriculture'].fillna(0.17, inplace=True)
data['service'].fillna(0.8, inplace=True)
data['industry'].fillna((1 - data['agriculture'] - data['service']), inplace= True)
print(data.isnull().sum()) Performed Data Cleaning – Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mis-labeled. To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward. Checked number of missing values in each column and imputed it with either replacing with zero or mean value.
fig, ax = plt.subplots(figsize=(16,16))
sns.heatmap(data.corr(), annot=True, ax=ax, cmap='BrBG').set(
title = 'Feature Correlation', xlabel = 'Columns', ylabel = 'Columns')
plt.show()
g = sns.pairplot(data[['population', 'area', 'net_migration', 'gdp_per_capita', 'climate']], hue='climate')
g.fig.suptitle('Feature Relations')
plt.show()
fig = plt.figure(figsize=(18, 24))
plt.title('Regional Analysis')
ax1 = fig.add_subplot(4, 1, 1)
ax2 = fig.add_subplot(4, 1, 2)
ax3 = fig.add_subplot(4, 1, 3)
ax4 = fig.add_subplot(4, 1, 4)
sns.countplot(data= data, y= 'region', ax= ax1, palette='BrBG')
sns.barplot(data= data, y= 'region', x= 'gdp_per_capita', ax= ax2, palette='BrBG', ci= None)
sns.barplot(data= data, y= 'region', x= 'net_migration', ax= ax3, palette='BrBG', ci= None)
sns.barplot(data= data, y= 'region', x= 'population', ax= ax4, palette='BrBG', ci= None)
plt.show()
fig = plt.figure(figsize=(12, 4))
data.groupby('region')['gdp_per_capita'].mean().sort_values().plot(kind='bar', color='coral')
plt.title('Regional Average GDP per Capita')
plt.xlabel("Region")
plt.ylabel('Avg. GDP per Capita')
plt.show()
fig = plt.figure(figsize=(12, 12))
sns.jointplot(data= data, x= 'literacy', y= 'gdp_per_capita', kind= 'hex',color='coral')
plt.title('GDP Analysis: GDP vs Literacy')
plt.show()
fig = plt.figure(figsize=(12, 12))
sns.jointplot(data= data, x= 'arable', y= 'gdp_per_capita', kind= 'hex', color='coral')
plt.title('GDP Analysis: GDP vs Arable Land')
plt.show()
fig = plt.figure(figsize=(12, 12))
sns.jointplot(data= data, x= 'infant_mortality', y= 'gdp_per_capita', kind= 'hex',color='coral')
plt.title('GDP Analysis: GDP vs Infant Mortality Rate')
plt.show() Performed EDA – The primary goal of EDA is to maximize the analyst's insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set, such as: a good-fitting, parsimonious model and a list of outliers. There are many libraries available in python like pandas, NumPy, matplotlib, seaborn to perform EDA. The four types of EDA are univariate non-graphical, multivariate non- graphical, univariate graphical, and multivariate graphical. First, I plotted correlation heatmap. Insights from it is, expected stronge correlation between infant_mortality and birthrate, unexpected stronge correlation between birthrate and phones. A correlation heatmap uses colored cells, typically in a monochromatic scale, to show a 2D correlation matrix (table) between two discrete dimensions. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. Secondly, I plotted pairplot graph. Insights from it is, a fair correlation between GDP and migration, which makes sense, since migrants tend to move to countries with better opportunities and higher GDP per capita. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n,2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots. Third, I plotted bar-plot on regional analysis. Insights from it is, Western Europe and North America have the highest GDP per capita, while Sub-Saharian Africa has the lowest GDP per capita. A barplot shows shows comparisons among discrete categories. Fourth, I plotted joint-plot. Insights from it is, it is clear that the higher the country's GDP, the more literate the population is and poor countries suffer more from infant mortality. Joint-plot is displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins.
data_final = pd.concat([data,pd.get_dummies(data['region'], prefix='region')], axis=1).drop(['region'],axis=1)
print(data_final.info())
data_final.head()
y = data_final['gdp_per_capita']
X = data_final.drop(['gdp_per_capita','country'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
sc_X = StandardScaler()
X2_train = sc_X.fit_transform(X_train)
X2_test = sc_X.fit_transform(X_test)
y2_train = y_train
y2_test = y_test
y3 = y
X3 = data_final.drop(['gdp_per_capita','country','population', 'area', 'coastline_area_ratio', 'arable',
'crops', 'other', 'climate', 'deathrate', 'industry'], axis=1)
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=101)
sc_X4 = StandardScaler()
X4_train = sc_X4.fit_transform(X3_train)
X4_test = sc_X4.fit_transform(X3_test)
y4_train = y3_train
y4_test = y3_test Performed Data Pre-conditioning – Here, I am making the data ready for training. ‘get_dummies’ will convert your categorical string values into dummy variables. ‘pd.get_dummies’ create a new data frame containing unique values as columns which consists of zeros and ones. First, I Transform 'region' column into numerical values. Splitting data into train and test set in order to prediction w.r.t X_test. This helps in prediction after training data. Secondly, I Split data set into training and testing parts (80/20), while dropping the countries column (string, and not going to be used to train the models), and separating gdp_per_capita column, where it will be used as labels. Thirdly, performed splits of dataset - with/without feature selection, with/without feature scaling. ‘fit_transform()’ means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So, for training set, you need to both calculate and do transformation. ‘StandardScalar()’ transforms the data in such a manner that it has mean as 0 and standard deviation as 1. In short, it standardizes the data. Standardization is useful for data which has negative values. It arranges the data in a standard normal distribution.
##Linear Regression
lm1 = LinearRegression()
lm1.fit(X_train,y_train)
lm2 = LinearRegression()
lm2.fit(X2_train,y2_train)
lm3 = LinearRegression()
lm3.fit(X3_train,y3_train)
lm4 = LinearRegression()
lm4.fit(X4_train,y4_train)
lm1_pred = lm1.predict(X_test)
lm2_pred = lm2.predict(X2_test)
lm3_pred = lm3.predict(X3_test)
lm4_pred = lm4.predict(X4_test)
print('Linear Regression Performance:')
print('\nall features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y_test, lm1_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, lm1_pred)))
print('R2_Score: ', metrics.r2_score(y_test, lm1_pred))
print('\nall features, with scaling:')
print('MAE:', metrics.mean_absolute_error(y2_test, lm2_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y2_test, lm2_pred)))
print('R2_Score: ', metrics.r2_score(y2_test, lm2_pred))
print('\nselected features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y3_test, lm3_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y3_test, lm3_pred)))
print('R2_Score: ', metrics.r2_score(y3_test, lm3_pred))
print('\nselected features, with scaling:')
print('MAE:', metrics.mean_absolute_error(y4_test, lm4_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y4_test, lm4_pred)))
print('R2_Score: ', metrics.r2_score(y4_test, lm4_pred))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y4_test,lm4_pred,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Linear Regression Prediction Performance (features selected and scaled)')
plt.grid()
plt.show()
## SVR
svm1 = SVR(kernel='rbf')
svm1.fit(X_train,y_train)
svm2 = SVR(kernel='rbf')
svm2.fit(X2_train,y2_train)
svm3 = SVR(kernel='rbf')
svm3.fit(X3_train,y3_train)
svm4 = SVR(kernel='rbf')
svm4.fit(X4_train,y4_train)
svm1_pred = svm1.predict(X_test)
svm2_pred = svm2.predict(X2_test)
svm3_pred = svm3.predict(X3_test)
svm4_pred = svm4.predict(X4_test)
print('SVM Performance:')
print('\nall features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y_test, svm1_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, svm1_pred)))
print('R2_Score: ', metrics.r2_score(y_test, svm1_pred))
print('\nall features, with scaling:')
print('MAE:', metrics.mean_absolute_error(y2_test, svm2_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y2_test, svm2_pred)))
print('R2_Score: ', metrics.r2_score(y2_test, svm2_pred))
print('\nselected features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y3_test, svm3_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y3_test, svm3_pred)))
print('R2_Score: ', metrics.r2_score(y3_test, svm3_pred))
print('\nselected features, with scaling:')
print('MAE:', metrics.mean_absolute_error(y4_test, svm4_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y4_test, svm4_pred)))
print('R2_Score: ', metrics.r2_score(y4_test, svm4_pred))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y3_test,svm3_pred,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Unoptimized SVM prediction Performance (with feature selection, and scaling)')
plt.grid()
plt.show()
param_grid = {'C': [1, 10, 100], 'gamma': [0.01,0.001,0.0001], 'kernel': ['rbf']}
grid = GridSearchCV(SVR(),param_grid,refit=True,verbose=3)
grid.fit(X4_train,y4_train)
grid.best_params_
grid.best_estimator_
grid_predictions = grid.predict(X4_test)
print('MAE:', metrics.mean_absolute_error(y4_test, grid_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y4_test, grid_predictions)))
print('R2_Score: ', metrics.r2_score(y4_test, grid_predictions))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y4_test,grid_predictions,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Optimized SVM prediction Performance (with feature selection, and scaling)')
plt.grid()
plt.show()
## Random Forest
rf1 = RandomForestRegressor(random_state=101, n_estimators=200)
rf3 = RandomForestRegressor(random_state=101, n_estimators=200)
rf1.fit(X_train, y_train)
rf3.fit(X3_train, y3_train)
rf1_pred = rf1.predict(X_test)
rf3_pred = rf3.predict(X3_test)
print('Random Forest Performance:')
print('\nall features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y_test, rf1_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, rf1_pred)))
print('R2_Score: ', metrics.r2_score(y_test, rf1_pred))
print('\nselected features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y3_test, rf3_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y3_test, rf3_pred)))
print('R2_Score: ', metrics.r2_score(y3_test, rf3_pred))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y_test,rf1_pred,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Random Forest prediction Performance (No feature selection)')
plt.grid()
plt.show()
rf_param_grid = {'max_features': ['sqrt', 'auto'],
'min_samples_leaf': [1, 3, 5],
'n_estimators': [100, 500, 1000],
'bootstrap': [False, True]}
rf_grid = GridSearchCV(estimator= RandomForestRegressor(), param_grid = rf_param_grid, n_jobs=-1, verbose=0)
rf_grid.fit(X_train,y_train)
rf_grid.best_params_
rf_grid.best_estimator_
rf_grid_predictions = rf_grid.predict(X_test)
print('MAE:', metrics.mean_absolute_error(y_test, rf_grid_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, rf_grid_predictions)))
print('R2_Score: ', metrics.r2_score(y_test, rf_grid_predictions))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y_test,rf_grid_predictions,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Optimized Random Forest prediction Performance (No feature selection)')
plt.grid()
plt.show()
## Gradient Boosting
gbm1 = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100, min_samples_split=2, min_samples_leaf=1, max_depth=3,
subsample=1.0, max_features= None, random_state=101)
gbm3 = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100, min_samples_split=2, min_samples_leaf=1, max_depth=3,
subsample=1.0, max_features= None, random_state=101)
gbm1.fit(X_train, y_train)
gbm3.fit(X3_train, y3_train)
gbm1_pred = gbm1.predict(X_test)
gbm3_pred = gbm3.predict(X3_test)
print('Gradiant Boosting Performance:')
print('\nall features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y_test, gbm1_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, gbm1_pred)))
print('R2_Score: ', metrics.r2_score(y_test, gbm1_pred))
print('\nselected features, No scaling:')
print('MAE:', metrics.mean_absolute_error(y3_test, gbm3_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y3_test, gbm3_pred)))
print('R2_Score: ', metrics.r2_score(y3_test, gbm3_pred))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y_test,gbm1_pred,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Gradiant Boosting prediction Performance (No feature selection)')
plt.grid()
plt.show()
feat_imp = pd.Series(gbm1.feature_importances_, list(X_train)).sort_values(ascending=False)
fig = plt.figure(figsize=(12, 6))
feat_imp.plot(kind='bar', title='Importance of Features', color= 'coral')
plt.ylabel('Feature Importance Score')
plt.grid()
plt.show()
gbm_param_grid = {'learning_rate':[1,0.1, 0.01, 0.001],
'n_estimators':[100, 500, 1000],
'max_depth':[3, 5, 8],
'subsample':[0.7, 1],
'min_samples_leaf':[1, 20],
'min_samples_split':[10, 20],
'max_features':[4, 7]}
gbm_tuning = GridSearchCV(estimator =GradientBoostingRegressor(random_state=101),
param_grid = gbm_param_grid,
n_jobs=-1,
cv=5)
gbm_tuning.fit(X_train,y_train)
print(gbm_tuning.best_params_)
gbm_grid_predictions = gbm_tuning.predict(X_test)
print('MAE:', metrics.mean_absolute_error(y_test, gbm_grid_predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, gbm_grid_predictions)))
print('R2_Score: ', metrics.r2_score(y_test, gbm_grid_predictions))
fig = plt.figure(figsize=(12, 6))
plt.scatter(y_test,gbm_grid_predictions,color='coral', linewidths=2, edgecolors='k')
plt.xlabel('True GDP per Capita')
plt.ylabel('Predictions')
plt.title('Optimized Gradiant Boosting prediction Performance')
plt.grid()
plt.show()
gbm_opt = GradientBoostingRegressor(learning_rate=0.01, n_estimators=500,max_depth=5, min_samples_split=10, min_samples_leaf=1,
subsample=0.7,max_features=7, random_state=101)
gbm_opt.fit(X_train,y_train)
feat_imp2 = pd.Series(gbm_opt.feature_importances_, list(X_train)).sort_values(ascending=False)
fig = plt.figure(figsize=(12, 6))
feat_imp2.plot(kind='bar', title='Importance of Features (Optimized)', color= 'coral')
plt.ylabel('Feature Importance Score')
plt.grid()
plt.show() Performed Model Building, Prediction, Evaluation and Optimization – The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Predicted test data w.r.t. X_test. It helps in better analysing performance of model. Now, performed model evaluation by using MAE, RMSE and R2 Score Error Measures. It indicates how better model is performing. Optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found. You can either use GridSearchCV or RandomizedSearchCV method. The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. GridSearchCV helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters. RandomizedSearchCV is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It is similar to grid search, and yet it has proven to yield better results comparatively. Gradient boost identifies difficult observations by large residuals computed in the previous iterations. In Gradientboost “shortcomings” are identified by gradients. Random Forest works well with both categorical and continuous values. RF provides accurate results most of the time. Scikit-Learn implements a set of sensible default hyperparameters for all models, but these are not guaranteed to be optimal for a problem. The best hyperparameters are usually impossible to determine ahead of time, and tuning a model is where machine learning turns from a science into trial-and-error based engineering. Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model. However, evaluating each model only on the training set can lead to one of the most fundamental problems in machine learning: overfitting. If we optimize the model for the training data, then our model will score very well on the training set, but will not be able to generalize to new data, such as in a test set. When a model performs highly on the training set but poorly on the test set, this is known as overfitting, or essentially creating a model that knows the training set very well but cannot be applied to new problems. It’s like a student who has memorized the simple problems in the textbook but has no idea how to apply concepts in the messy real world. An overfit model may look impressive on the training set, but will be useless in a real application. Therefore, the standard procedure for hyperparameter optimization accounts for overfitting through cross validation. To evaluate the prediction error rates and model performance in regression analysis. Once you have obtained your error metric/s, take note of which X’s have minimal impacts on y. Removing some of these features may result in an increased accuracy of your model. RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model. All of them require two lists as parameters, with one being your predicted values and the other being the true values. MAE: The easiest to understand. Represents average error.
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
st.title('Country GDP Estimation Tool')
st.write('''
This app will estimate the GDP per capita for a country, given some
attributes for that specific country as input.
Please fill in the attributes below, then hit the GDP Estimate button
to get the estimate.
''')
st.header('Input Attributes')
att_popl = st.number_input('Population (Example: 7000000)', min_value=1e4, max_value=2e9, value=2e7)
att_area = st.slider('Area (sq. Km)', min_value= 2.0, max_value= 17e6, value=6e5, step=1e4)
att_dens = st.slider('Population Density (per sq. mile)', min_value= 0, max_value= 12000, value=400, step=10)
att_cost = st.slider('Coastline/Area Ratio', min_value= 0, max_value= 800, value=30, step=10)
att_migr = st.slider('Annual Net Migration (migrant(s)/1,000 population)', min_value= -20, max_value= 25, value=0, step=2)
att_mort = st.slider('Infant mortality (per 1000 births)', min_value= 0, max_value=195, value=40, step=10)
att_litr = st.slider('Population literacy Percentage', min_value= 0, max_value= 100, value=80, step=5)
att_phon = st.slider('Phones per 1000', min_value= 0, max_value= 1000, value=250, step=25)
att_arab = st.slider('Arable Land (%)', min_value= 0, max_value= 100, value=25, step=2)
att_crop = st.slider('Crops Land (%)', min_value= 0, max_value= 100, value=5, step=2)
att_othr = st.slider('Other Land (%)', min_value= 0, max_value= 100, value=70, step=2)
st.text('(Arable, Crops, and Other land should add up to 100%)')
att_clim = st.selectbox('Climate', options=(1, 1.5, 2, 2.5, 3))
st.write('''
* 1: Mostly hot (like: Egypt and Australia)
* 1.5: Mostly hot and Tropical (like: China and Cameroon)
* 2: Mostly tropical (like: The Bahamas and Thailand)
* 2.5: Mostly cold and Tropical (like: India)
* 3: Mostly cold (like: Argentina and Belgium)
'''
)
att_brth = st.slider('Annual Birth Rate (births/1,000)', min_value= 7, max_value= 50, value=20, step=2)
att_deth = st.slider('Annual Death Rate (deaths/1,000)', min_value= 2, max_value= 30, value=10, step=2)
att_agrc = st.slider('Agricultural Economy', min_value= 0.0, max_value= 1.0, value=0.15, step=0.05)
att_inds = st.slider('Industrial Economy', min_value= 0.0, max_value= 1.0, value=0.25, step=0.05)
att_serv = st.slider('Services Economy', min_value= 0.0, max_value= 1.0, value=0.60, step=0.05)
st.text('(Agricultural, Industrial, and Services Economy should add up to 1)')
att_regn = st.selectbox('Region', options=(1,2,3,4,5,6,7,8,9,10,11))
st.write('''
* 1: ASIA (EX. NEAR EAST)
* 2: BALTICS
* 3: C.W. OF IND. STATES
* 4: EASTERN EUROPE
* 5: LATIN AMER. & CARIB
* 6: NEAR EAST
* 7: NORTHERN AFRICA
* 8: NORTHERN AMERICA
* 9: OCEANIA
* 10: SUB-SAHARAN AFRICA
* 11: WESTERN EUROPE
'''
)
if att_regn == 1:
att_regn_1 = 1
att_regn_2 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 2:
att_regn_2 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 3:
att_regn_3 = 1
att_regn_1 = att_regn_2 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 4:
att_regn_4 = 1
att_regn_1 = att_regn_3 = att_regn_2 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 5:
att_regn_5 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_2 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 6:
att_regn_6 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_2 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 7:
att_regn_7 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_2 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 8:
att_regn_8 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_2 = att_regn_9 = att_regn_10 = att_regn_11 = 0
elif att_regn == 9:
att_regn_9 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_2 = att_regn_10 = att_regn_11 = 0
elif att_regn == 10:
att_regn_10 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_2 = att_regn_11 = 0
else:
att_regn_11 = 1
att_regn_1 = att_regn_3 = att_regn_4 = att_regn_5 = att_regn_6 = att_regn_7 = att_regn_8 = att_regn_9 = att_regn_10 = att_regn_2 = 0
user_input = np.array([att_popl, att_area, att_dens, att_cost, att_migr,
att_mort, att_litr, att_phon, att_arab, att_crop,
att_othr, att_clim, att_brth, att_deth, att_agrc,
att_inds, att_serv, att_regn_1, att_regn_2, att_regn_3,
att_regn_4, att_regn_5, att_regn_6, att_regn_7,
att_regn_8, att_regn_9, att_regn_10, att_regn_11]).reshape(1,-1)
#------
# Model
#------
#import dataset
def get_dataset():
data = pd.read_csv('countries-of-the-world.csv')
return data
if st.button('Estimate GDP'):
data = get_dataset()
#fix column names
data.columns = (["country","region","population","area","density",
"coastline_area_ratio","net_migration","infant_mortality",
"gdp_per_capita","literacy","phones","arable","crops","other",
"climate","birthrate","deathrate","agriculture","industry",
"service"])
#Fix data types
data.country = data.country.astype('category')
data.region = data.region.astype('category')
data.density = data.density.astype(str)
data.density = data.density.str.replace(",",".").astype(float)
data.coastline_area_ratio = data.coastline_area_ratio.astype(str)
data.coastline_area_ratio = data.coastline_area_ratio.str.replace(",",".").astype(float)
data.net_migration = data.net_migration.astype(str)
data.net_migration = data.net_migration.str.replace(",",".").astype(float)
data.infant_mortality = data.infant_mortality.astype(str)
data.infant_mortality = data.infant_mortality.str.replace(",",".").astype(float)
data.literacy = data.literacy.astype(str)
data.literacy = data.literacy.str.replace(",",".").astype(float)
data.phones = data.phones.astype(str)
data.phones = data.phones.str.replace(",",".").astype(float)
data.arable = data.arable.astype(str)
data.arable = data.arable.str.replace(",",".").astype(float)
data.crops = data.crops.astype(str)
data.crops = data.crops.str.replace(",",".").astype(float)
data.other = data.other.astype(str)
data.other = data.other.str.replace(",",".").astype(float)
data.climate = data.climate.astype(str)
data.climate = data.climate.str.replace(",",".").astype(float)
data.birthrate = data.birthrate.astype(str)
data.birthrate = data.birthrate.str.replace(",",".").astype(float)
data.deathrate = data.deathrate.astype(str)
data.deathrate = data.deathrate.str.replace(",",".").astype(float)
data.agriculture = data.agriculture.astype(str)
data.agriculture = data.agriculture.str.replace(",",".").astype(float)
data.industry = data.industry.astype(str)
data.industry = data.industry.str.replace(",",".").astype(float)
data.service = data.service.astype(str)
data.service = data.service.str.replace(",",".").astype(float)
#fix missing data
data['net_migration'].fillna(0, inplace=True)
data['infant_mortality'].fillna(0, inplace=True)
data['gdp_per_capita'].fillna(2500, inplace=True)
data['literacy'].fillna(data.groupby('region')['literacy'].transform('mean'), inplace= True)
data['phones'].fillna(data.groupby('region')['phones'].transform('mean'), inplace= True)
data['arable'].fillna(0, inplace=True)
data['crops'].fillna(0, inplace=True)
data['other'].fillna(0, inplace=True)
data['climate'].fillna(0, inplace=True)
data['birthrate'].fillna(data.groupby('region')['birthrate'].transform('mean'), inplace= True)
data['deathrate'].fillna(data.groupby('region')['deathrate'].transform('mean'), inplace= True)
data['agriculture'].fillna(0.17, inplace=True)
data['service'].fillna(0.8, inplace=True)
data['industry'].fillna((1 - data['agriculture'] - data['service']), inplace= True)
#Region Transform
data_final = pd.concat([data,pd.get_dummies(data['region'], prefix='region')], axis=1).drop(['region'],axis=1)
#Data Split
y = data_final['gdp_per_capita']
X = data_final.drop(['gdp_per_capita','country'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
#model training
gbm_opt = GradientBoostingRegressor(learning_rate=0.01, n_estimators=500,
max_depth=5, min_samples_split=10,
min_samples_leaf=1, subsample=0.7,
max_features=7, random_state=101)
gbm_opt.fit(X_train,y_train)
#making a prediction
gbm_predictions = gbm_opt.predict(user_input) #user_input is taken from input attrebutes
st.write('The estimated GDP per capita is: ', gbm_predictions) Created Web App – Built web app in streamlit for end-users.
Challenge that I faced in this project is, performance of model was not upto the mark. So, I tried various models to find the best one. One can also use RandomizedSearchCV method to get useful results.
Linear Regression – Base Model
1)Model Training
2)Predictions
3)Model Evaluation
4)Model Visualization
Support Vector Regression – First Model
1)Model Training
2)Predictions
3)Model Evaluation
4)Model Visualization
5)Model Optimization
6)Model Prediction, Evaluation and Visualization after optimization
Random Forest Regression – Second Model
1)Model Training
2)Predictions
3)Model Evaluation
4)Model Visualization
5)Model Optimization
6)Model Prediction, Evaluation and Visualization after optimization
Gradient Boosting Regression – Third Model
1)Model Training
2)Predictions
3)Model Evaluation
4)Model Visualization
5)Feature Importance
6)Model Optimization
7)Model Prediction, Evaluation and Visualization after optimization
8)Feature Importance
Overall Model Analysis
Overall Model Analysis:
Checking the Model Visualization
Basic Model Evaluation Graphs
Creation of App
Here, I am creating Streamlit App. It takes the value or input from the user. Performs Calculations. See the data type matches. Check for any missing value. It transforms the region. Then split the data. And making the final predictions.
Technical Aspect
Numpy used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. It contains a multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays.
Pandas module mainly works with the tabular data. It contains Data Frame and Series. Pandas is 18 to 20 times slower than Numpy. Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data.
Matplotlib is used for EDA. Visualization of graphs helps to understand data in better way than numbers in table format. Matplotlib is mainly deployed for basic plotting. It consists of bars, pies, lines, scatter plots and so on. Inline command display visualization inline within frontends like in Jupyter Notebook, directly below the code cell that produced it.
Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. It provides a variety of visualization patterns and visualize random distributions.
Sklearn is known as scikit learn. It provides many ML libraries and algorithms for it. It provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
Need to train_test_split - Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions. The train_test_split function allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.
Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. It is a very easy library to create a perfect dashboard by spending a little amount of time. It also comes with the inbuilt webserver and lets you deploy in the docker container. When you run the app, the localhost server will open in your browser automatically.
‘StandardScaler()’ removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. ‘StandardScaler()’ can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.
"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You cross_val to tune parameters and get an estimate of the score.
Installation
Using intel core i5 9th generation with NVIDIA GFORECE GTX1650.
Windows 10 Environment Used.
Already Installed Anaconda Navigator for Python 3.x
The Code is written in Python 3.8.
If you don't have Python installed then please install Anaconda Navigator from its official site.
If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip, python -m pip install --upgrade pip and press Enter.
Run-How to Use-Steps
Keep your internet connection on while running or accessing files and throughout too.
Follow this when you want to perform from scratch.
Open Anaconda Prompt, Perform the following steps:
cd
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install streamlit
Note: If it shows error as ‘No Module Found’ , then install relevant module.
You can also create requirement.txt file as, pip freeze > requirements.txt
Create Virtual Environment:
conda create -n gdp python=3.6
y
conda activate gdp
cd
run .py or .ipynb files.
Paste URL to browser to check whether working locally or not.
Follow this when you want to just perform on local machine.
Download ZIP File.
Right-Click on ZIP file in download section and select Extract file option, which will unzip file.
Move unzip folder to desired folder/location be it D drive or desktop etc.
Open Anaconda Prompt, write cd
eg: cd C:\Users\Monica\Desktop\Projects\Python Projects 1\23)End_To_End_Projects\Project_10_ML_FileUse_End_To_End_Countries_GDP_Prediction\Project_ML_CountriesGDPPrediction
conda create -n gdp python=3.6
y
conda activate gdp
In Anconda Prompt, pip install -r requirements.txt to install all packages.
In Anconda Prompt, write streamlit run app.py and press Enter.
Paste the URL (if it doesnot open automatically) to browser to check whether working locally or not.
Please be careful with spellings or numbers while typing filename and easier is just copy filename and then run it to avoid any silly errors.
Note: cd
[Go to Folder where file is. Select the path from top and right-click and select copy option and paste it next to cd one space
Directory Tree-Structure of Project
To Do-Future Scope
Can deploy on AWS and Google Cloud.
Technologies Used-System Requirements-Tech Stack
Conclusion
Modeling
More data cleaning may give better results with GBM.
Analysis
I received MAE = 2142.13, RMSE = 3097.19 and R2_Score = 0.88 for Random Forest Model which is best in all comparatively.
Credits
Zeglam
Paper Citation
Paper Citation Link 1 here
Paper Citation Link 2 here